Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

122

Randomness and Complexity

In anticipation of the following sections, we can already note that incompress-

ibility (i.e., the total absence of regularities) forms a criterion of randomness. This

criterion uses the notion of algorithmic complexity. The ﬁrst sequence can be gener-

ated by the brief instruction “write ‘1’ 32 times” and the second by the only marginally

longer statement “write ‘01’ 16 times”, whereas the third, which was generated by

blindly tapping on a keyboard, has no apparent regularity.

“Absence of pattern” corresponds to the dictionary synonym “haphazard” (cf. the

French expression “au hasard”). By counting the number of 1s and 0s in a long seg-

ment of the third sequence, we can obtain an estimate of the probability of occurrence

of each symbol. “Haphazard” then means that the choice of each successive symbol is

made independently, without reference to the preceding symbol or symbols, in sharp

contrast to the second sequence, which could also be generated by the algorithm “if

the preceding symbol is 1, write 0, otherwise write 1” operating on a starting seed

of 1.

Note how closely this exercise of algorithmic compression is related to the general

aim of science: to ﬁnd the simplest set of axioms that will enable all the observable

phenomena studied by the branch of science concerned to be explained (an empirical

fact being “explained” if the propositions expressing it can be shown to be a conse-

quence of the axioms constituting the scientiﬁc theory underpinning that branch). For

example, Maxwell’s equations turned out to be suitable for explaining the phenomena

of electromagnetism. ¹

The meaning of randomness as denoting independence from what has gone before

is well captured in the familiar expression “random access memory”, the signiﬁcance

being that a memory location can be selected arbitrarily (cf. the German “beliebig”,

at whim), as opposed to a sequential access memory, whose elements can only be

accessed one after the other. Mention of memory brings to mind the fact that suc-

cessive independent choices imply the absence of memory in the process generating

those choices.

The validity of the above is independent of the actual probabilities of choosing

symbols; that is, they may be equal or unequal. Although in many organisms it turns

out that the frequencies of occurrence of all four bases are in fact equal, this is by no

means universal, it being well known that thermophilic bacteria have more C identical to≡G

base pairs than Aequals= T in their genes, since the former, being linked by three hydrogen

bonds, are more thermally stable than the latter, which only have two (cf. Fig. 15.3).

Yet, we can still speak of randomness in this case. In binary terms, it corresponds to

unequal probabilities of heads or tails, and the sequence may still be algorithmically

1 An obvious corollary of this association of randomness with algorithmic compressibility is that

there is an intrinsic absurdity in the notion of an algorithm for generating random numbers, such

as those included with many compilers and other software packages. These computer-generated

pseudorandom numbers generally pass the usual statistical tests for randomness, but little is known

about how their nonrandomness affects results obtained using them. Quite possibly the best heuristic

sources of (pseudo)random digits are the successive digits of irrational numbers likepiπ orStartRoot 2 EndRoot

√

2. These

can be generated by a deterministic algorithm and, of course, are always the same, but in the sense

that one cannot jump to (say) the hundredth digit without computing those preceding it, they do

fulﬁl the criteria of haphazardness.